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We argue that Non-sequential Recursive Pair Substitution (NSRPS) as suggested by Jimenez- 
Montano and Ebeling can indeed be used as a basis for an optimal data compression algorithm. 
In particular, we prove for Markov sequences that NSRPS together with suitable codings of the 
substitutions and of the substitute series does not lead to a code length increase, in the limit of 
infinite sequence length. When applied to written English, NSRPS gives entropy estimates which 
are very close to those obtained by other methods. Using ca. 135 GB of input data from the project 
Gutenberg, we estimate the effective entropy to be ~ 1.82 bit/character. Extrapolating to infinitely 
long input, the true value of the entropy is estimated as « 0.8 bit/character. 
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I. INTRODUCTION 

The discovery that the amount of information in a mes- 
sage (or in any other structure) can be objectively mea- 
sured was certainly one of the major scientific achieve- 
ments of the 20th century. On the theoretical side, this 
quantity - the information theoretic entropy - is of inter- 
est mainly because of its close relationship to thermody- 
namic entropy, its importance for chaotic systems, and 
its role in Bayesian inference (maximum entropy princi- 
ple). Practically, estimating the entropy of a message 
(text document, picture, piece of music, etc.) is im- 
portant because it measures its compressibility, i.e. the 
optimal achievement for any possible compression algo- 
rithm. In the following, we shall always deal with se- 
quences (so, s\, . . .) built from the characters of a finite 
alphabet A = {do, . . . , a m -i} of size m. In the simplest 
case the alphabet consists just of 2 characters, in which 
case the maximum entropy is 1 bit per character. 

Indeed, information entropy as introduced by Shan- 
non 0] is a probabilistic concept. It requires a measure 
(probability distribution) to be defined on the set of all 
possible sequences. In particular, the probability for s t 
to be given by a/., given all characters So, si, . . . , St-%, is 
given by 

Pt (k\k',k" ,...)= (1) 
prob(s t = a k | s t _i = a k >, s 4 _ 2 = a k », . ■ .)• 

In case of a stationary measure with finite range cor- 
relations, pt(k\k', k", . . .) becomes independent of t for 
t — > co. Then Shannon's famous formula. 



h = lim h {i 



(2) 



with 



h (€> = - P(ki---h)\og 2 p(k 1 \k 2 ...k l ) , (3) 

k\...ki 

gives the average information per character. The gen- 
eralization to non-stationary measures is straightforward 
but will not be discussed here. 



In contrast to this approach are attempts to define 
the exact information content of a single finite sequence. 
Theoretically, the basic concept here is the algorithmic 
complexity AC (or algorithmic randomness) ^ |3|. For 
any given universal computer U, the AC of a sequence S 
relative to U is given by the length of the shortest pro- 
gram which, when input to U , prints S and then makes 
U to stop, so that the next sequence can be read. If S 
is randomly drawn from a stationary ensemble with en- 
tropy h, then one can show that the AC per character 
tends towards h, for almost all S and all U, as the length 
of S tends towards infinity j|. Thus, except for rare se- 
quences which do not contribute to averages, h sets the 
limit for the compressibility. 

Practically, the usefulness of AC is limited by the fact 
that there cannot exist any algorithm which finds for 
each S its shortest code (such an algorithm could be used 
to solve Turing's halting problem, which is known to be 
impossible) p|. But one can give algorithms which are 
often quite efficient. Huffman, arithmetic, and Lempel- 
Ziv coding are just three well known examples J5|. Any 
such algorithm can be used to give an upper bound to 
h (modulo fluctuations from the finite sequence length) 
while, inversely, knowledge of h sets a lower limit to the 
average code lengths possible with these codes. 

A data compression scheme is called optimal, if it does 
not do much worse than the best possible for typical ran- 
dom strings. More precisely, let {S} be a set of sequences 
with entropy h(S), and let the code string C(S) be built 
from an alphabet of mc characters. Then we call the 
coding scheme C : S — > C(S) optimal, if 



length[C(5)] _ h 
length [S] log 2 mc 



for length [5] — > oo (4) 



and for nearly all S. While Huffman coding is not opti- 
mal, arithmetic and Lempel-Ziv codings are [^). 

In several papers, Jimenez-Montaho, Ebeling, and 
others M have suggested coding schemes by non- 
sequential recursive pair substitution (NSRPS) ||. Call 
the original sequence So- We count the numbers njk 
of non-overlapping successive pairs of characters in Sq 



2 



where s t = a,j and st+i = <Zfe, and find their maximum, 
Timax — max -j.k<m n jk- The corresponding index pair is 
(jo, ko)- Then we introduce a new character by concate- 
nation 

a m = (a jo a ko ) (5) 

and form the sequence Si by replacing everywhere the 
pair cij ak by a m . For the special case of jo = kg, any 
string of 2r + 1 characters a j0 is replaced by r characters 
a m , followed by one a JO . 

This is then repeated recursively: The sequence Si+i 
is obtained from Si by replacing the most frequent pair 
a i< a fe* by a new character a m+ i. The procedure stops if 
one can argue that further replacements would not pos- 
sibly be of any use. Typically this will happen if the 
code length consisting of both a description of Si+i and 
a description of the pair (ji, fcj) is definitely longer than 
a description of Si, for the present and all subsequent i. 

Thus one sees that efficient encodings (which must also 
be uniquely decodable!) of the sequences Si and of the 
type of substituted pairs become crucial for the analysis 
of NSRPS. Unfortunately, the "codings" given in |], [?|] 
are neither efficient nor uniquely decodable Thus 
their "complexities" have no direct relationship to h or 
to algorithmic complexity (in contrast to their claim), 
and it is not clear from their work whether NSRPS can 
be made into an optimal coding scheme at all. 

It is the purpose of the present paper to give at least 
partial answers to this. More precisely, we shall only be 
concerned with the limit of infinitely long strings, where 
the information encoded in the pairs (ji,ki) can be ne- 
glected in comparison with the information stored in Si, 
at least for any finite i. We will first show analytically 
that a coding scheme for Si exists which satisfies a neces- 
sary condition for optimality (Sec. 2). We then apply this 
to written English (Sec. 3), where we shall also compare 
our estimates of h to those obtained with other methods. 



II. NSRPS FOR MARKOV SEQUENCES 

Let us for the moment assume that Sq is binary (the 
two characters are "0" and "1" ), and that it is completely 
random, i.e. identically and independently distributed 
(iid) with the same probability for each character. Thus 
p(0\ ...) = p(l\ ...) = 1/2, and h = 1 bit. The length of 
Sq is Nq, thus the total average information stored in Sq 
is N bits. 

No coding scheme can reduce the length of C(Sq) to 
less than Nq bits on average. Indeed, all schemes will 
have length[C(5o)] > No bits (strict inequality!), unless 
the "coding" is a verbatim copy. For a coding scheme to 
be optimal, a necessary (but not sufficient) condition is 
that 

length[C(S'o)]/iV -> 1 bit (6) 

for A^o — > oo, i.e. the overhead in the code must be less 
than extensive in the sequence length. This is what we 



want to show here, together with its generalization to 
arbitrary (first order) Markov sequences. 
For this, we need two lemmata: 

Lemma 1: For any Markov sequence Sq (not necessar- 
ily binary, and not necessarily iid) built from m letters, 
the sequence Si is again Markov. 

Lemma 2: If a word w — (k, k', k", . . .) appears sev- 
eral times in Sq, and if one of these instances is substi- 
tuted in Si by a string of characters not straddling its 
boundaries, then all other instances of w in Sq are also 
substituted in Si by the same string. 

Lemma 1 tells us that NSRPS might make the struc- 
ture of Si more complex than that of So, but not much 
so. Being a Markov chain, its entropy can be estimated if 
the transition probabilities p(k\ki) are known. Thus es- 
timating the entropy of Si reduces to estimating di-block 
entropies hS 2 \ which is straightforward (at least in the 
limit A^o — > oo). 

Lemma 2 tells us that there cannot be any ambiguity in 
Si. In particular, it cannot happen that more information 
is needed to specify Si than there is needed to specify 
So, since the mapping So — > Si is bijective, once the 
substitution rules are fixed. 

The proofs of the lemmata are easy. Let us denote 
by Pj(. . .) the probability distributions after j pair sub- 
stitutions. For lemma 1 we just have to show that 
pi(k\k',k") is independent of k" for each pair (k,k'), 
provided the same holds also for po- This follows ba- 
sically from the fact that any substitution makes the se- 
quence shorter. But the detailed proof is somewhat te- 
dious, because pi(k\k',k") ^ po(k\k' ,k"), even if all fe's 
are less than m, k ^ ko, k" =/= jo, and neither (k,k') 
nor (k 1 ,k") are equal to the pair (jo,ko)- In that case, 
(No - n max )pi(k\k' , k") = N p (k\k',k"), and indepen- 
dence of k" follows immediately. All other cases have 
to be dealt with similarly. For instance, if either (k, k') 
or (k',k") is the pair (jo,ko), then pi(k, k', k") = 0. 
Else, if k" = m ^ k,k', then Pl (k\k',k") = N /(N - 
n max )po(k\k',j ,k Q ) = No/(N -n max )po(k\k'). We leave 
the other cases as exercises to the reader. 

For proving lemma 2 we proceed indirectly. We as- 
sume that there is a word in So which is encoded dif- 
ferently in different locations. Let us assume that this 
difference happened for the first time after i substitu- 
tions. Since only one type of pair is exchanged in each 
step, this means that a substitution is skipped in one of 
the locations, at this step. But this is impossible, since 
all possible substitutions are made at each step. 

From the two lemmata we obtain immediately our cen- 
tral 

Theorem: If So is drawn from a (first order) 
Markov process with length Nq and entropy ho = 
— J2k k' Po(k, k') \og 2 po(k\k'), then every Si is also 
Markovian in the limit Nq oo, with entropy 

h * = h ? ] = -5>i(M')i°g 2 Mfc|fc') (?) 

k,k' 

and with length N satisfying Ni/No = ho /hi. 
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Thus the total amount of information needed to specify 
Si is the same as that for So, for infinitely long sequences. 
Since the overhead needed to specify the pairs (j,, fcj) can 
be neglected in this limit, we see that we do not loose 
code length efficiency by pair substitution, provided we 
take pair probabilities correctly into account during the 
coding. The actual encoding can be done by means of an 
arithmetic code based on the probabilities pi(k\k') ||, 
but we shall not work out the details. It is enough to 
know that the code length then becomes equal to the 
information (both measured in bits), for N — ► oo. 

Let us see in detail how all this works for completely 
random iid binary sequences. The original sequence 
So = 00101001111010011011... has p (00) = po(01) = 
po(10) = po(H) = 1/4 and therefore h = 1 bit. Thus 
we can, without loss of generality, assume that the new 
character is 2 = (01), so that Si = 02202111202121 . . .. 
The 3 characters are now equiprobable, Pi(0) = Pi(l) = 
Pi(2) = 1/3, but they are not independent since of 
course pi(01) = 0. Indeed, one finds pi(00) = P i(02) = 
Pi(ll) = Pl (21) = 1/6, Pl (10) - pi(12) = Pl (20) = 
Pi(22) = 1/12. The order-2 entropy of Si is easily cal- 

culated as h\ = 4/31og 2 2. On the other hand, since 
A*o/4 pairs have been replaced by single characters, the 
length of Si is Ni = 3N /4. Thus, if Si is Markov, then 

(2) 

the total information needed to specify it is Nih\ = Nq 
bits, the same as for Sq. If it were not Markov, its in- 
formation would be smaller. But this cannot be, because 
the map So — > Si was invertible. Thus Si must indeed 
be Markov, as can also be checked explicitly. 

In the next step, we can either replace (21) — > 3 or 
(02) — > 3, since both have the same probability. If we do 
the former, the sequence becomes S2 = 02203112033 . . .. 
Now the letters are no longer equiprobable, ^2(1) = 
P2 {2) = pa(3) = 1/5, pa(0) = 2/5. Calculating 



N2,P2(kk'), and h 



(2) 



again 



(2) 



2'* 2 



2 

N 



is straightforward, and one finds 
bits. Thus one concludes that S2 



must also be Markov, 
still verify 



For the next few steps one can 



Nihf ] = ...N bits 



(8) 



by hand, but this becomes increasingly tedious as i in- 
creases. 

Thus we have verified Eq.(j^) by extensive simulations, 
where we found that it is exact, within the expected fluc- 
tuations, up to several thousand substitutions (Fig.l). 
The distribution of the probabilities Pi(k) becomes very 

wide for large i, i.e. the sequences Si are far from uniform 

(2) 

for large i, but they are Markov and their entropies h\ 
are exactly (within the expected systematic finite sample 
corrections [0 fjj) equal to N /Ni bits. Notice that if 
we would encode the last Si without taking the correla- 
tions into account (as seems suggested in ||, Q ) , then the 
code length for it would be larger and the coding scheme 
would not be optimal. 

We have also made some simulations where we started 
with non-trivial Markov processes for Sq, or even with 
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FIG. 1: Results for a completely random (iid, uniformly dis- 
tributed) binary initial sequence of iVo = 8 x 10 8 bits, plotted 
against the size of the extended alphabet. Uppermost curve: 
code length needed to encode Si, divided by iVo, if log 2 (i + 2) 
bits are used for each character. Middle curve: code length 
based on h\ , i.e. the single-character distributions pi(k) are 
used in the encoding. Lowest curve, indistinguishable on this 
scale from a horizontal straight line: code length based on 
h!f\ using the two-character distributions pi(k,k'). 
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FIG. 2: Ranked single character probability distributions 
Pi(k) of strings after i = 2298 pair substitutions. The dif- 
ferent curves are for a completely random iid initial string 
So (solid line), iid string So with po(0) = 0.29 (long dashed), 
So obtained by applying two times CA rule 150 to an iid se- 
quence with p(0) = 0.09 (dashed), and to written English 
with a reduced (46 character) alphabet (dotted). 



non-Markov sequences with known entropy. The latter 
were generated by creating initially a binary iid sequence 
with p(0) 7^ p(l), and then using this as an input con- 
figuration for a few iterations of the bijective cellular au- 
tomaton R150 (in Wolfram's notation) Jl7[ . 

— (2) 

From these simulations it seems that Nih\ always 
tends towards 7V . Also, the probability distributions 
Pi(k) seem to tend (very slowly, see Fig. 2) to the same 
scaling limit as for iid and uniform So- This suggests 
that indeed Si tends to a Markov process for arbitrary 
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Sq. In this case an optimal coding would be obtained 
if one would use, e.g., an arithmetic code to encode Si 
by using approximate values of the observed pi(k\k') for 
large i. 

Thus we have given strong (but still incomplete) ar- 
guments that NSRPS combined with efficient coding of 
Si gives indeed an optimal coding scheme. In practice, it 
would of course be extremely inefficient in terms of speed, 
and thus of no practical relevance. But it could well be 
that it might lead to more stringent entropy estimates 
than other methods. To test this we shall now turn to 
one of the most complex and interesting system, written 
natural language. 



III. THE ENTROPY OF WRITTEN ENGLISH 

The data used for the application of NSRPS to en- 
tropy estimation of written English consisted of ca. 150 
MB of text taken from the Project Gutenberg homepage 
p0| . It includes mainly English and American novels 
from the 19th and early 20th century (Austen, Dick- 
ens, Galsworthy, Melville, Stevenson, etc.), but also some 
technical reports (e.g. Darwin, historical and sociologi- 
cal texts, etc.), Shakespeares collected works, the King 
James Bible, and some novels translated from French and 
Russian (Verne, Tolstoy, Dostoevsky, etc.). 

From these texts we removed first editorial and legal 
remarks added by the editors of Project Gutenberg. We 
also removed end-of-line, end-of-page, and carriage re- 
turn characters. All runs of consecutive blanks were re- 
placed by a single blank. Finally, we also removed all 
characters not in the 7-bit ASCII alphabet (ca. 4200 in 
total). These cleaned texts were then concatenated to 
form one big input string of 148,214,028 characters. 

Entropies were estimated both from this string (which 
still contained upper and lower case letters, numbers, all 
kinds of brackets and interpunctation marks, 95 differ- 
ent characters in total) , and from a version with reduced 
alphabet. In the latter, we changed all letters to upper 
case; all brackets to either ( or ); the symbols $,#,&,*,%, 
@ to one single symbol; colons, exclamation and ques- 
tion marks to points; quotation marks to apostrophes; 
and semicolons to commas. This reduced alphabet had 
then 46 letters (including, of course, the blank " u "). 

The most frequent pair of letters in English is "e u ". 
After replacing it by a new "letter", the next pair to 
substitute is "ut", then " u a", "uth", etc. Very soon also 
longer strings are substituted, e.g. after 92 steps appears 
the first two- word combination, "of u thc u ". 

As long as the number of new symbols is still small, it 
is easy to estimate the pair probabilities, and from this 

(2) 

an upper bound hi = h\ 'Ni/No on the entropy. This 
becomes more and more difficult as the alphabet size in- 
creases, as the sampling becomes insufficient even with 
our very long input file, and we can no longer approx- 
imate the pi(k,k') by the observed relative frequencies. 
As long as the number of different subsequent pairs is 
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FIG. 3: Entropy estimates h from pair probabilities plotted 
against the size of the extended alphabet. Upper curve is 
for the initial 7 bit alphabet, including upper and lower case 
letters. The lower curve is for the reduced (46 letter) initial 
alphabet. The smooth dotted line passing through the lower 
data set is a fit with Eq.ffll). 



much smaller than the sequence length (i.e., most pairs 
are observed many times), we can still get reliable esti- 
mates of hi by using the leading correction term discussed 
in Jl^, But finally, when many pairs are seen only 
once in the entire text, we have to stop since any estimate 

(2) 

of h\ becomes unreliable. 

We went up to 6000 substitutions. The longest 
substrings substituted by a single new symbol had 
length 13 in the original (95 letter) alphabet, and 
length 16 in the reduced (46 letter) one (the latter 
was "would u have u been u "). The entropies h per (orig- 
inal) character are plotted in Fig. 3. We see that they 
are very similar for both alphabets. We find h « 1.8 
bits/ character after 6000 substitutions. This number is 
very close to the value obtained from most other methods 
(with the exception of |L4|], where ~ 1.5 bits/character 
were obtained), if one uses 10 — 100 MB of input text 
[It]]. This is surprising in view of two facts. First 



17 are very different, 



of all, the methods applied in |lf 
and one might have thought a priori that they are able 
to use different structures of the language to achieve high 
compression rates. Apparently they do not. 

Secondly, it is clear that h ss 1.8 bits/character is not a 
realistic estimate of the true entropy of written English. 
Even though we can not, with our present text lengths 
and our computational resources, go to much larger al- 
phabet sizes (i.e. to more substitutions), it is clear from 
Fig. 3 that both curves would continue to decrease. Let us 
denote by i the number of substitutions. Then empirical 
fits to both curves in Fig. 3 are given by 



(9) 



(i + i a ) a 



Such a fit to the 46 letter data, with h = 0.7, ig = 34, c = 
4.99, and a = 0.1745, is also shown in Fig. 3. One should 
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of course not take it too serious in view of the very slow 
convergence with i and the very long extrapolation, but 
it suggests that the true entropy of written English is 
0.7 ± 0.2 bits/character. 

This estimate is somewhat lower than estimate of |H| 
and the extrapolations given in Jl7| . It is comparable 
with that of [ |l8| and with Shannon's original estimate 
fl9|| . It seems definitely _to exclude the possibility h — 
which was proposed in 
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IV. CONCLUSIONS 

We have shown how a strategy of non-sequential re- 
placements of pairs of characters can yield efficient data 
compression and entropy estimates. A similar strategy 
was first proposed by Jimenez-Montano and others, but 
details and the actual coding done in the present paper 
are quite different from those proposed in j^, . Indeed, 
this strategy was never used in || [5j for actual codings, 
and it was also not used for realistic entropy estimates. 

Compared to conventional sequential codes (such as 
Lempel-Ziv or arithmetic codes ||, just to mention two), 
the present method would be much slower. Instead of 
a single pass through the data as in sequential coding 
schemes, we had gone up to 6000 times through the data 
file, in order to achieve a high compression rate. We 
could do of course with much less passes, if we would be 
content with compression rates comparable to those of 
commercial packages such as "zip" or "compress". For 
written English these achieve typically compression fac- 
tors » 2.6, i.e. ca. 3 bits/character. As seen from Fig.l, 
this can be achieved by NSRPS very easily with very 



few passes, but even then the overhead and the compu- 
tational complexity of NSRPS is much too high to make 
it a practical alternative. 

NSRPS can be seen as a greedy and extremely simple 
version of off-line textual substitution |plf . In combina- 
tion with other sophisticated techniques, similar substi- 
tutions can give excellent results juj . But without these 
techniques, it is in general believed that only much more 
sophisticated versions of off-line textual substitution are 
of any interest 21 . Again this is presumably true as far 
as practical coding schemes are concerned. But things 
seem to be different if one is interested in entropy esti- 
mation. Here the present method is much simpler (even 
though computationally more demanding) than the tree- 
based gambling algorithms [lTj that had given the 
best results up to now. Without extrapolation, it gives 
the same (upper bound) estimates as these methods. But 
it seems that it allows a more reliable extrapolation to 
infinite text length and infinite substitution depth, and 
thus a more reliable estimate of the true asymptotic en- 
tropy. 

From the mathematical point of view, we should how- 
ever stress that we have only partial results. While we 
have proven that the Markov structure is a fixed point of 
the substitution, we have not proven that it is attractive. 
We thus cannot prove that the present strategy is indeed 
universally optimal, although we believe that our numer- 
ical results strongly support this conjecture. A rigorous 
proof would of course be extremely welcome. 

I thank Ralf Andrzejak, Hsiao-Ping Hsu, and Walter 
Nadler for carefully reading the manuscript and for useful 
discussions. 
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